-
Notifications
You must be signed in to change notification settings - Fork 44
Add parallel-task-set crate, test it, use it #8174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hawkw
approved these changes
May 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, with a couple of nits. there are some places in Nexus we could use this too!
Test failure in |
hawkw
added a commit
that referenced
this pull request
May 20, 2025
Presently, several Nexus background tasks use `tokio::task::JoinSet` to run a large number of Tokio tasks in parallel. In these tasks, we typically set a concurrency limit on the number of spawned tasks using the size of a database query that's used to determine the tasks that should be spawned. We perform the query with a small page size, spawn a group of tasks, and wait for them to complete, in a loop, until the query returns no records. While this is simple to implement, it's not the ideal way to do this, as it will unnecessarily limit the throughput of the spawned tasks. This is because this pattern does not ensure that *exactly* `$CONCURRENCY_LIMIT` tasks are running at a given time, it ensures that *up to* `$CONCURRENCY_LIMIT` tasks are running. Since the database is not queried again to spawn a new batch of tasks until after the *entire* batch of tasks complete, there will always be some period of time during which only a single task is running and all the others have completed. If there's a relatively large variation in how long those tasks take to complete, one slow task can potentially prevent any others from starting for a longish period of time. An alternative approach, where the tasks are spawned all at once but made to wait on a `tokio::sync::Semaphore` before they actually begin executing, allows us to maximize throughput while limiting concurrency. In this approach, a new task will begin executing immediately as soon as another task finishes, so there are always exactly `$CONCURRENCY_LIMIT` tasks running until the final batch of tasks begins to complete. The `parallel-task-set` crate added in PR #8174 implements a reusable abstraction for this, so this branch updates the `instance_watcher` and `webhook_deliverator` background tasks to use it. Furthermore, @smklein and I spent some time tweaking the `ParallelTaskSet` API to make it easier to limit not only the number of tasks _executing_ in parallel, but also the number of tasks _resident in memory_ at any given time, by changing the `spawn` method to wait for a previous task to complete if the set is already at the limit. Note that I did *not* change the `instance_updater` background task to use `ParallelTaskSet` in this manner, as all it does is run `instance_update` sagas. Unlike the `instance_updater` and `webhook_deliverator` background tasks, which make HTTP requests to sled-agents and external webhook endpoints, respectively, this task just spawns sagas and waits for them to finish. So, its spawned tasks aren't actually doing any _work_ besides waiting on a `RunningSaga` future to complete, and the actual work is performed in the saga executor. Concurrency limiting the actual work would require the concurrency limit to be implemented in the saga executor, and not the background task. Also, it's important that all the sagas be _started_ as soon as possible, even if the current nexus does not execute them, so that they may be picked up by other Nexii. Similarly, the `instance_reincarnation` task also performs a query-spawn-batch-wait type loop, but in that case, it's necessary as the sagas started for each instance in the query performs the state change that evicts it from a subsequent query. Therefore ,that task _must_ wait for all sagas in the batch to complete before proceeding.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Follow-up from support bundles work.
This crate exposes a "JoinSet"-like interface which also has a bound on maximum parallelism.